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A Bayesian Nonparametric Causal Model for Regression Discontinuity Designs 


1 Background, Purpose and Novelty of Study 

The regression discontinuity (RD) design (Thistlewaite & Campbell, 1960; Cook, 2008) pro- 
vides a framework to identify and estimate causal effects from a non-randomized design. Each 
subject of a RD design is assigned to the treatment (versus assignment to a non-treatment) 
whenever her/his observed value of an the assignment variable equals or exceeds a cutoff 
value. The RD design provides a "locally-randomized experiment" under remarkably mild 
conditions, so that the causal effect of treatment outcomes versus non-treatment outcomes 
can be identified and estimated at the cutoff (Lee, 2008). Such effect estimates are similar 
to those of a randomized study (Goldberger, 2008/1972). As a result, since 1997, at least 74 
RD-based empirical studies have emerged in the fields of education, political science, psy- 
chology, economics, statistics, criminology, and health science (see van der Klaauw, 2008; 
Lee & Lemieux, 2010; Bloom, 2012; Wong et al. 2013; Li et ah, 2013). Polynomial and 
local linear models are standard for RD designs (Bloom, 2012; Imbens & Lemieux, 2008). 
However, these models can produce biased causal effect estimates, due to the presence of 
outliers of treatment outcomes; and/or due to incorrect choices of the bandwidth parameter 
for the local linear model. Currently, the correct choice of bandwidth has only been justi- 
fied by large-sample theory (Imbens & Kalyanaraman, 2012), and the local linear model for 
quantile regression (Frandsen et al., 2012) suffers from the "quantile crossing" problem. 

We introduce a novel formulation of our Bayesian nonparametric regression model (BLIND, 
2012), which provides causal inference for RD designs. It is an infinite-mixture model, that 
allows the entire probability density of the outcome variable to change flexibly as a function 
of the assignment variable. Moreover, our Bayesian model can provide inferences of causal 
effects, in terms of how the treatment variable impacts the mean, variance, a quantile, distri- 
bution function, probability density, hazard function, and/or any other chosen functional of 
the outcome variable. Moreover, the accurate causal effect estimation relies on a predictively- 
accurate model for the data. The Bayesian nonparametric regression model attained best 
overall predictive performance, over many real data sets, compared to many other regression 
models (BLIND, 2012). Finally, we will illustrate our Bayesian model through the causal 
analysis of two real educational data sets. 

2 Identifying Causal Effects in a RD Research Design 

In a RD design, let R, be a continuous- valued assignment variable (Berk & Rauma, 1983) 
having a known cutoff ro, for each subject i. Then in such a design, the treatment assignment 
mechanism is defined by A = l(Ri > r 0 ), with a realization denoted by = 1 { , r l > r 0 ), 
where l(-) is the indicator function. In the sharp RD design (Thistlewaite & Campbell, 
1960), the treatment receipt probability function is defined by Pr(T — 1\R — r) — 1 (r > r 0 ), 
and thus it has a discontinuous jump of 1 at r o- In a fuzzy RD design (Trochim, 1984), the 
probability function Pr(T = 1 1 R = r) has a discontinuous jump that is smaller than 1, at 
r 0 . This smaller jump is a result of imperfect treatment compliance, which can occur in 
settings where the assignment variable R measures the eligibility to receive a treatment, and 
some ineligible subjects (with R , < r 0 ) decided to receive treatment (i.e. , R = 1), and some 
eligible subjects (with A* > r 0 ) decided receive the non-treatment (i.e., R = 0). 

For each subject of a given RD study, indexed by i — 1, . . . , n, let R(A ro 1 ) = 1 indicate 
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receipt of the treatment, and let Tj(A ro ’ ) = 0 indicates receipt of the non-treatment, when 
assigned treatment Aff d e {t = 0, 1}. Also, denote Yj(Alf n \ T(A ( r ^ n> )) as the 2 2n potential 
outcomes to treatments that defined a common time point, for all R n = (A],..., R n ) r , 
Ai y = (A^\ . . . , A^" } ) T , and all T(A<f">) = (T^A^h • • • , T n (A ( * n) )) T (e.g, Angrist, 
et al., 1996). Now, suppose that data of a RD design satisfies the following five assump- 
tions: RD: the existence of limits lim r | ro E(T|r) ^ lim r | ro E(Xjr) (Hahn, et al. 2001); Local 

SUTVA (LS): T(a£ f” o) )) = y*(A<f } , T^A™)) for all n 0 subjects with n near 

r 0 (Cattaneo et al. 2013); Local Exclusion Restriction (LER): T(A^ n °' ) )) = 

Lj(A ro n ° , T(A ro "° )) for all (R no ,R no ) and for all no subjects with r,; near r 0 (e.g., Angrist, 
et al., 1996); Local Monotonicity (LM): T i (Aro 0+c ' > ) > TpA^ff c ' > ) for some e > 0 and for every 
subject i with r t e (r 0 — e, r 0 + e) (Hahn, et al. 2001); Local Randomization (LR): Each 
subject, indexed by w, has "imprecise control" over R, i.e. , F n (r\w) = Pr(f? < r\w) is contin- 
uous in r at ro, with 0 < Fn(ro\w) < 1 (Lee, 2008). Then, for the subgroup of subjects with 
assignment variables r t near ro, a compiler is a subject with (T)(l), T)(0)) = (0, 1); the 2 2n ° 
potential outcomes Y t ( Arif" 0 \ T(A { * no) )) reduce to two potential outcomes Yi(t),t = 0,1; 
and then T(l) — Y ( 0 ) is a causal effect of T on Y; and for any functional h { • } of Y, an 
estimator of the causal effect of T on h{Y}, at the cutoff r 0 , is given by: 


r = E[/r{Tj(l)} — /i{Y)(0)}|r 0 and i is a compiler] 


linypo E[h{T}l r ] ~ tinypo E[h{Y}\r] 
Pr[i is a complieijro] 


( 1 ) 

(Imbens & Lemieux, 2008). Depending on the choice of h{-}, the causal effect estimator 
(1) describes how much the treatment T impacts either the mean, variance, distribution 
function, a quantile, probability density, or any other feature of the outcome h\Y}. The 
denominator of (1) is identical to lim r | ro E[T|r] — lim r f ro E[T|r]; and a sharp RD design has 
Pr[i is a complieijr 0 ] = 1, and trivially satisfies assumptions RD, LER, and LM. Also, the 
two data sets that will analyzed in the the present study, satisfies assumption LR, because 
arguably for each data set, each subject has imprecise control over the assignment variable. 


3 A Statistical (Causal) Model for RD Designs 


For the sharp RD design, our Bayesian nonparametric model is given by: 

OO 

f(ViVh < i} ) = E n (f/*l/R, tf)uj(v u (ri),<r u ( r i)), i = (2a) 

j=- oo 

UjiVuir), <rM) = $(0‘ - - $({j - 1 - vjr)}/a u (r)) (2b) 

V u (r) = Poc + Eh + P^rJ (2c) 

<T u (r) = exp(A a)0 + A w ir + A W 24 o } ) 1/2 (2d) 

( fij,a j) ~ normal (/Tj | / t m , CT 2 )inverseGamma(of |1, b a ) (2e) 

(/V°m) ~ normal(/i M |/i 0 ,ao)uniform(cT M |0,6 (TM ) (2f) 

(b a ,(3 u , A w ) ~ gamma(6 CT |a 0 ,&o)normal(/3 u ,, A w |0,tT p+ i) (2g) 
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where $(•) is the Normal(-|0, 1) c.d.f.; and the mixture weights tch'( 7 7u>( r ), cr u (r)) sum to 1 at 
every value of r. Our model (2) has infinite-dimensional parameter T = ((p, cr^)°fi_ 0C , /r , 

cr^, b a , fl^, A w ), and the degree of multimodality of f(y\r, ai r J) depends on the size of cr w (r) 
(BLIND, 2012). Obviously, there is a discontinuity in the regression at ro, when either of 
the parameters (/3 w2 > ^2) is non-zero. Moreover, when prior information is limited, we may 
specify vague prior hyper-parameters /i 0 = 0, cr^ — 1 > 00, clq — > 0, 60 — ► 0, and v = 10 5 , along 
with a choice of prior parameter b aix that reflects prior knowledge about the range of Y. 

A set of data T> n = {y t , r t , af' 0 ^}f =i updates the prior density 7r(r) to a posterior den- 
sity, given by 7r(r|X> n ) oc ]Jf =1 f(yi\ri,ai r 0 ^;T)7r(T) up to a proportionality constant. Then 
E n (y|r, ai r J) — /|/ yf{y\r,a^\T)dy^dIi(Y\D n ) gives the posterior predictive expectation 

of Y conditionally on (r, (if; ) ■ If all five assumptions hold for the sharp RD design, then a pos- 
terior estimate of the causal effect of T on Y is given by r/ t = E n (h{y}\r 0 , 1) — E n (h{y}\ro, 0), 
for any choice of functional h { •} . Existing Markov Chain Monte Carlo (MCMC) methods 
can be used to estimate the posteriors 7r(r|2>„) and E n (h{y}\r, ai r J) (BLIND, 2012). 

Our causal model (2) can be extended to a fuzzy RD design, where it is only known that 
Pr[i is a compiler |r 0 ] < 1. This extension involves the estimation of bounds of the causal 
effects T/i, over a plausible ranges of Pr[z is a compiler |r 0 ] and of LM and ER violation magni- 
tudes (Angrist et ah, 1996). It is prudent to estimate such bounds, because in the fuzzy RD 
design, the identifying LM and LER assumptions are empirically falsifiable and unverifiable, 
and because the estimation of Pr[i is a compiler |r 0 ] is also empirically unverifiable because 
the estimator (1) does not identify the compilers (e.g., Balke & Pearl, 1997). 

4 Applicability of Model; Setting; Interventions; Subjects; Data Collection and 

Analysis; Results; Conclusions 

Two data sets were collected under a partnership between four Chicago university schools of 
education, which implemented a new curriculum that aims to train and graduate teachers 
to improve Chicago public school education. Using Windows-based software developed by 
the first author, we analyzed each of the two data sets using the Bayesian model, specified 
by the vague priors mentioned earlier, along with prior specification b aiJ = 5. All posterior 
estimates, reported below, are based on 40K MCMC samples, which led to accurate posterior 
estimates according to standard convergence assessments (Geyer, 2011). 

For the first data set, the aim is to estimate the effect of the new teacher education cur- 
riculum on math teaching ability, among undergraduate teacher education students attending 
one of four Chicago universities. This data set involves a sharp RD design, specifically an 
interrupted time-series design (Cook & Campbell, 1979, Ch. 5), with the assignment variable 
of time, ranging from fall semester 2007 through spring semester 2013. The new curriculum 
(treatment) was instituted in Fall 2010 (the cutoff, r 0 ), and the old teacher curriculum (non- 
treatment) was active before that time. The outcome variable is the number-correct score 
on the 25-item Learning Math for Teaching (LMT) test (University of Michigan). The LMT 
score was obtained from each student, who had just completed a course on teaching algebra. 
A total of n = 347 students completed the LMT test (89.9% female; 135 and 212 students 
under the old and new curriculum). Among these students, the Cronbach’s alpha reliability 
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of the LMT test score is .63, and the average LMT score was 12.9 (s.d.= 3.44). 

Using our Bayesian model, we analyzed the data to estimate the effect of the new cur- 
riculum, versus the old curriculum, on student ability to teach math (LMT score), at the Fall 
semester 2010 cutoff. The model included the LMT test z-score as the outcome (dependent) 
variable, and included covariates of the assignment variable TinreFlO = (Year — 2010.9)/10 
and of the treatment assignment variable CTPP = Aiol'oif' * = 1 (TinreFlO > 0), with time 
2010.9 referring to the Fall 2010 cutoff. Our model displayed good fit to these data. The 
standardized residuals ranged from —0.84 to 0.77 over the 347 observations, and R-squared 
was .92. Figure 1 presents the model’s posterior predictive density estimate of the LMT 
outcome, for the new curriculum (treatment) and for the old curriculum (non-treatment), 
at Fall 2010. As shown, the new curriculum, compared to the old curriculum, increased 
the LMT scores, in terms of shifting the density of LMT scores to the right. This shift 
corresponds to an increase in the mean (from .17 to .20), the 10%ile ( — 1.43 to —1.35), the 
median (.07 to .15), and corresponds to a variance decrease (1.78 to 1.69). 

The second data set, from another sharp RD design, involves n = 205 undergraduate 
teacher education students, each of whom enrolled into one of the four Chicago schools 
of education during either the year of 2010, 2011, or 2012 (90% female; mean age=22.5, 
s.d.=5.35, n = 203); 47%, 21%, 10%, and 22% attended the four universities; 49%, 41% and 
10% enrolled in 2010, 2011, and 2012). It is of interest to investigate the causal effect of 
basic skills on teacher performance (e.g., Gitomer & Brown, 2011), because most U.S. schools 
of education based their undergraduate admissions decisions on the ability of individual 
applicants to pass basic skills tests. Here, the assignment variable is defined by the score on 
an Illinois test of reading basic skills, with minimum cutoff passing score of 240. The outcome 
variable is the total score on the 50-item Haberman (2008) Teacher Pre-screener assessment, 
which has a test-retest reliability of .93, and has a 95% accuracy rate in predicting which 
teachers will stay and succeed in the teaching profession (Haberman, 2008). A score in the 
40-50 range indicates a very effective teacher, and many schools use the Haberman Pre- 
screener to assess applicants of teaching positions. Among all the 205 students of the RD 
design, the average reading basic skills score is 204.69 (s.d.=33.7); and the average Haberman 
Pre-screener score is 29.82 (s.d.=4.32). 

Using the Bayesian model, we analyzed the data set to estimate the causal effect of 
passing the reading basic skills exam (treatment), versus not passing (non-treatment), on 
students’ ability to teach in urban schools. The model included the Haberman z-score 
as the outcome (dependent) variable, and included covariates of the assignment variable 
Rd240 = (Read — 240) /10 and the reading (Read) score passing (assignment) indicator 
ReadPass = = l(Read > 240). Our model fit the data well. The standardized 

residuals ranged from —1.7 to 1.22 over the 205 observations, and R-squared was .98. Figure 
2 presents the model’s posterior predictive density estimates, for treatment versus non- 
treatment. A detailed inspection revealed that passing the basic skills reading test causally 
increased the Haberman z-score, in terms of the mean (from .13 to .26), median (.05 to .28), 
75%ile (.97 to 1.43), 90%ile (1.66 to 2.21), 95%ile (2.13 to 2.82), and variance (1.60 to 3.36); 
and causally decreased the 5%ile (—1.74 to —2.38) and 10%ile (—1.30 to —1.70). Also, the 
treatment density and the non-treatment density each has two modes (clusters) of students, 
with below-average and above-average Haberman z-scores, respectively. 
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Appendix B: Figures 


CTPP = 0 (Blue) vs. 1 (Red) 
TimeFlO = 0 



ZjDosttest 


Figure 1: For the LMT z-score outcomes, the posterior predictive density estimates of F(l) 
(red) and of F(0) (blue). 


ReadPass = 0 (Blue) vs. 1 (Red) 
Rd240 = 0 



Z_haberman 


Figure 2: For the Haberman z-score outcomes, the posterior predictive density estimates of 
Y{ 1) (red) and ofF(O) (blue). 
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